Boosting the performance of pretrained CNN architecture on dermoscopic pigmented skin lesion classification

Abstract Background Pigmented skin lesions (PSLs) pose medical and esthetic challenges for those affected. PSLs can cause skin cancers, particularly melanoma, which can be life‐threatening. Detecting and treating melanoma early can reduce mortality rates. Dermoscopic imaging offers a noninvasive and cost‐effective technique for examining PSLs. However, the lack of standardized colors, image capture settings, and artifacts makes accurate analysis challenging. Computer‐aided diagnosis (CAD) using deep learning models, such as convolutional neural networks (CNNs), has shown promise by automatically extracting features from medical images. Nevertheless, enhancing the CNN models' performance remains challenging, notably concerning sensitivity. Materials and methods In this study, we aim to enhance the classification performance of selected pretrained CNNs. We use the 2019 ISIC dataset, which presents eight disease classes. To achieve this goal, two methods are applied: resolution of the dataset imbalance challenge through augmentation and optimization of the training hyperparameters via Bayesian tuning. Results The performance improvement was observed for all tested pretrained CNNs. The Inception‐V3 model achieved the best performance compared to similar results, with an accuracy of 96.40% and an AUC of 0.98. Conclusion According to the study, classification performance was significantly enhanced by augmentation and Bayesian hyperparameter tuning.

Dermoscopic images and biopsy images are commonly used to examine PSLs.3][4] However, there are challenges in accurately interpreting these images, such as the presence of artefacts, variability within and between image classes, and subjectivity in reading by doctors.
In recent years, machine learning, particularly deep learning, has been used to aid in PSL classification.Deep learning has the advantage of being able to directly process raw data without the need for extensive pre-processing methods.9][20][21][22][23] Furthermore, the field of interpretable machine learning or explainable artificial intelligence is expanding to address ethical concerns in the healthcare industry. 24is research will use pretrained CNN for PSL classification with the ISIC-2019 dataset.The main contribution of this research is using augmentation to overcome dataset imbalance and hyper-parameter optimization to improve the model performance.With these two treatments, the pretrained CNN achieves satisfactory performance and exceeds the existing results of similar studies.In the following, we list the contributions of this study: 1. We added preprocessing steps to improve the classification performance, such as data normalization, resizing, and augmentation.
2. We applied hyper-parameter optimization with Bayesian tuning on the added learning and dropout parameters.
3. We are implementing pretrained CNN with ImageNet transfer learning and adding a dropout layer before the last layer.This paper is presented with a structure: Section 1 is related to the background, this research's importance, and contribution statement.
Section 2 presents related works, followed by materials, methods, and experiment scenarios in Section 3. Section 4 discusses the results and discussion, and Section 6 concludes.

RELATED WORK
The related research described in this section explicitly classifies PSLs with the ISIC 2019 with eight classes dataset.Molina et al. 25 used DenseNet-201 with three classifiers to perform PSL classification with the ISIC-2019 dataset.They performed augmentation to address class data imbalance.While this approach resulted in high accuracy and precision, it was less successful in improving sensitivity, an essential parameter in the medical field that should not be overlooked.
Meanwhile, the augmentation process performed by Liu et al. 26 did not describe the methods and procedures used, and the performance results obtained were also unsatisfactory.

MATERIAL AND METHODS
This section describes the CNN from this study that was used for the classification of PSLs.This study identifies the impact of the pretrained CNN model on the ISIC 2019 dataset when applying augmentation and Bayesian tuning.
3. Models with the last variant in the family.

Experiment scenario
The experiments followed the flow chart shown in The results of tuning the defense rate and dropout values for each and dropout for each model are presented in Table 3.
The process runs on a device with 3.

Evaluation
Analyzing a learning algorithm on test data determines the algorithm's quality.The design of the evaluation matrix begins with the confusion matrix.The performance evaluation matrices commonly used in classification are sensitivity (SEN), specificity (SPE), accuracy (ACC), precision (PREC), and area under curve (AUC).

Training models
This stage ascertains whether a model is overfitting, underfitting, or fitting well.Models or architectures that fit well will achieve good test results.Figure 2  • The models that were trained with augmentation performed better than the models that were not trained with augmentation.This suggests that augmentation can help prevent overfitting.
• The models that were trained with Bayesian tuning performed better than the models that were not trained with Bayesian tuning.This suggests that Bayesian tuning can help improve a model's generalization performance.

Testing models
Once the training model has been completed, additional testing is conducted using data for testing purposes.The test data comprises images that are selected from the beginning of each class, with an interval of 100 images.The sensitivity, specificity, precision, accuracy, F1 Score, and AUC values for each model are calculated from the raw data contained in the confusion matrix.The confusion matrices of four CNN models for PSL classification with three treatments are presented in   Table 4 compares the performance of four pretrained CNN models under three treatments: no augmentation, with augmentation, and with augmentation along with Bayesian tuning.The metrics used to evaluate the models are Sensitivity (SEN), Specificity, Precision, Accuracy, F1 Score (F1), and Area Under the Curve (AUC).Here is a more detailed analysis of the results: • Without augmentation: In this section of the table, the models' performance is evaluated without any data augmentation.The metrics suggest that the models' overall performance is relatively lower across the board.
The highest Accuracy is around 55.50% for Inception-v3, and the F1 Score also ranges between approximately 22.86 and 54.10%.

• With augmentation:
When data augmentation is applied, the models' performance improves significantly across all metrics.This indicates that data augmentation helps the models to better generalize and perform well on new, unseen data.The Accuracy values are notably higher, ranging from 62.00 to 88.63%, with F1 Scores between approximately 61.19 and 88.53%.
• With augmentation and Bayesian tuning: The third section of the table introduces Bayesian tuning in addition to augmentation.Bayesian tuning is a hyperparameter optimization technique that can further enhance model performance.
As expected, this combination leads to even better performance across all metrics.The Accuracy values are now in the range of 91.13%-96.38%,and F1 Scores are higher, ranging from around 91.08%-96.29%.
Here is a more detailed analysis of the results for each model: • Inception-v3: This model achieved the best performance on all metrics.Inception-v3 is a relatively large model, which may explain why it performed so well.However, it is also a more computationally expensive model to train and deploy.
• Xception: This model achieved the second best performance on all metrics.Xception is a relatively new model that is designed to be efficient and accurate.It is a good choice for applications where both of these factors are important.
• DenseNet-201: This model achieved the third best performance on all metrics.DenseNet-201 is the deepest model among the four compared, which necessitates extensive computational resources.
• MobileNet-v2: This model achieved the lowest performance on all metrics.However, it is also the smallest and most computationally efficient model of the four.MobileNet-v2 is a good choice for applications where computational resources are very limited.This may be due to its architectural complexity, which may require more time for training convergence.
Molina et al. 25 address the problem of unbalanced classes by using three classifiers with linear plurality voting.Although this approach achieves high accuracy and precision, it fails to improve sensitivity, a crucial parameter in the medical field that cannot be overlooked.
Furthermore, this method requires extensive computational effort.
While the augmentation process of Liu et al. 26 lacks specific explanations regarding the methods and procedures used, the obtained performance results were also unsatisfactory.
Cauvery et al. 28 tackle the problem of unbalanced classes by using an online augmentation policy.Although this method has the advantage of not directly increasing the number of training images, it has numerous drawbacks, including dependence on an online connection, higher computational cost, dependence on the quality of the input data, and the risk of overfitting.Like the research of Kassem et al., 29 our study incorporates augmentation concepts where the number of images in each class is increased to approach the number of images in the largest class.However, our research shows several advantages, especially with respect to the data cleaning and splitting processes.
In particular, the test data are guaranteed to remain separate from the augmented training data.In Table 6, our model outperforms Kassem's model.

Limitations and future research
The limitation of this research is that it uses row images from the dataset and only performs augmentation and balancing between classes.It is possible to perform preprocessing to improve the quality of the input image so that the performance of the model can be improved.
For further research, it is still possible to improve the performance of the model by adding unique layers or modules such as attention, dense layers, and pooling or combining multiple models into an ensemble.Of course, this addition will increase the training time and requires optimization of appropriate parameters.To meet medical implementation requirements, it is necessary to conduct additional research for the model's interpretability.

CONCLUSIONS
The

1 .
convolution block and then fed to the fully connected block to generate predictions from the classification.Pretrained CNN is a model that has been trained with a specialized dataset.For the case of image classification, pretrained CNNs have usually been trained with the Ima-geNet Large Scale Visual Recognition Challenge (ILSVRC) or commonly called ImageNet (Russakovsky et al., 2015).ImageNet contains 1000 classes with 1 281 167 training images, 50 000 validation, and 100 000 test images.A pretrained CNN model is intelligent when trained with ImageNet and produces accuracy above 70% for top-one accuracy and above 90% for top-five accuracy.Currently, there are many pretrained CNN developed.Selected four models pretrained CNN models in this research as presented in Table 2. Consideration of the selection of CNN pretrained models is based on the following: Models with parameters below 25 million parameters due to available resources.

Figure 1 .TA B L E 3 1
The ISIC-2019 dataset of 15 331 dermoscopic images in eight classes of abnormality categories was taken as input.Preprocessing was F I G U R E 1 Experiment scenarios.Result of Bayesian hyper-parameter tuning.Bayesian tuning (D, p, E init , n, E stop , T) Define D subset ← Sample(D, p) Define D train , D val ← Split(D subset ) Define search space for hyperparameters  Define objective function f() using validation accuracy Initialize E ← E init , T ← 0 while T < n do Train model with hyperparameters  for E epochs Implement early stopping based on Update hyperparameters  using Bayesian optimization Update E opt using the best validation accuracy so far T ← T + 1 end while Train with optimal hyperparameters  * for E opt epochs Save the model with the best hyperparameters  * performed in the form of duplication removal, image size adjustment, and separation of 100 images per class for image size adjustment and splitting of 100 images per class for testing data.The remaining data for training were previously augmented and aligned to the number of images per class.Training process by applying five-fold cross-validation.The initial parameter configuration of the pretrained CNN was made the same as ADAM as the optimization algorithm, learning rate of 0, 01, dropout of 0, 1, batch size of 20 and epoch of 50 times.Parameter configuration is also done with Bayesian tuning for learning rate and dropout values.
CNNs present a significant opportunity to address classification problems related to PSLs.The present study evaluated several pretrained models' ability to classify PSL diseases.In this work, the ISIC 2019 dataset with eight classification classes was used as input for four CNN architectures, namely Inception-V3, DenseNet-201, Xception, and MobileNet-v2.The study evaluates the performance of these models.

•
shows the training and validation accuracy for four models: Inception-V3, DenseNet-201, Xception, and MobileNet-V2.The models were trained with three different treatments: no augmentation, with augmentation, and with augmentation along with Bayesian tuning.A good model results from the validation accuracy pattern following the training accuracy pattern.Here is our analysis of the graph: Inception-V3 has the highest training accuracy, but its validation accuracy plateaus after 10 epochs.This indicates that the model is no overfit the training data.• DenseNet-201 has a lower training accuracy than Inception-V3, but its validation accuracy continues to increase after 10 epochs.This indicates that the model does not overfit the training data.• Xception has a similar training accuracy to DenseNet-201, but its validation accuracy begins to decline after 15 epochs.This indicates that the model begins to overfit.• MobileNet-V2 has the lowest training accuracy of the four models, but its validation accuracy continues to increase throughout the training process.This indicates that the model is not overfitting the training data.All four models demonstrate satisfactory performance during training.Overfitting remains minimal, particularly for models that are F I G U R E 2 Training and validation accuracy of four models.The vertical axis depicts the accuracy value, while the horizontal axis depicts the number of epochs.A good model results from the validation accuracy pattern that follows the training accuracy pattern.augmented and tuned using Bayesian methods.The model's performance in the testing phase will confirm this conclusion.Here are some additional observations about the graph:

Figure 3 .•
Figure 3. Three conditions, namely without augmentation, with augmentation, and with augmentation as well as Bayesian tuning were used enhance the performance of CNN models for the classification of skin lesions.This is due to the fact that augmentation mitigates overfitting and enhances the models' generalization ability.•Another technique that can be applied to improve the performance of CNN models for skin lesion classification is Bayesian tuning.Bayesian tuning helps in identifying the optimal hyperparameters of the models resulting in improved accuracy.•The results suggest that Inception-V3 and DenseNet-201 are better equipped to handle overfitting than Xception and MobileNet-V2.This is probably because Inception-V3 and DenseNet-201 have deeper architectures.

Table 4
demonstrates the clear advantages of using data augmentation and Bayesian tuning to enhance the performance of pretrained TA B L E 5 Training duration of models.models.These techniques help the models generalize better, resulting in higher accuracy and better overall performance in classification tasks.Adding augmentation to the training process significantly improves the models' ability to generalize and make accurate predictions on new data.The combination of augmentation and Bayesian tuning further refines the models' performance, indicating that the models' hyperparameters are optimized to better fit the data.In terms of overall performance, Inception-v3 consistently achieves the highest scores across all treatments, followed by DenseNet-201, Xception, and then MobileNet-v2.Table5shows the duration of all models during training.The training time for all models decreases with data augmentation.This is surprising because data augmentation usually leads to a larger dataset, which consequently may result in longer training times.However, the augmented data may contribute to a faster convergence of the models due to increased diversity.When both data augmentation and Bayesian tuning are applied, the training time of Inception-v3 and MobileNet-v2 decreases significantly.This indicates that these models benefit from hyperparameter optimization and augmented data more than the other models.In all scenarios, Xception has the longest training times.
present study conducted an extensive evaluation concerning pretrained CNN models for the classification of PSLs, employing the ISIC 2019 dataset.The results highlighted the significance of data augmentation and Bayesian tuning techniques in enhancing the performance and generalizability of the models.Both Inception-V3 and DenseNet-201 consistently outperformed other models.This could be attributed to the effects of data augmentation and Bayesian tuning.Data augmentation helped prevent overfitting, and Bayesian tuning fine-tuned the hyperparameters of the models.Our models demonstrate promising outcomes, exceeding the accuracy and other metrics of various prior studies.However, we noted certain limitations, such as the use of raw images and the potential for further improvements through additional layers.Future research opportunities involve improving input quality through preprocessing and investigating advanced architectures.Our research establishes the foundation for accurate skin lesion classification using CNNs, which may provide potential for improved medical diagnosis and treatment.
29uvery et al.28addressed the problem of unbalanced classes by using an online augmentation policy.Although this method has the indirect advantage of increasing the number of training images, it has several disadvantages, including dependence on online connection, higher computational cost, dependence on input data quality, and risk of over-fitting.Like the work of Kassem et al.,29our study also incor- 27lipescu et al.27employed a pretrained CNN VGG-16 to classify PSLs in the ISIC-2019 dataset, comprising eight classes.The researchers conducted preprocessing by resizing images to match the pre-trained model and optimizing hyper-parameters.The accuracy achieved was 78.11%.

Class Initial data No ID lesion Testing dataset Training dataset Augmented training dataset
This dataset was used for the 2019 International Skin Imaging Collaboration competition.This dataset was chosen because it has many images and the most classes at this time.The data contain dermoscopycaptured images and metadata.ISIC 2019 dataset comprises eight disease classes with imbalance distribution, almost half melanocytic nevus (NV).

Pretrained CNN Top-one accuracy (%) Top-five accuracy (%) Depth Parameter (million) Input size
neural layers as classifiers.Information is formed or extracted by the the number of percentage subsets of the dataset p, the number of initial epochs E init , the number of trials n, and the number of stopping epochs E stop .This research uses 25% of the augmented dataset as training data for Bayesian tuning with the number of trials n = 5 and the objective function validation accuracy.
1. the model without data augmentation, 2. the model with augmentation data, 3. the model with data augmentation and Bayesian tuning.Algorithm 1 presents the Bayesian tuning procedure that will be implemented in this study.The input variables required in this pro-cedure are the dataset D,

TA B L E 4
Comparison of performance of four pretrained CNN models under three treatments: no augmentation, augmentation, and augmentation with Bayesian tuning.
• The findings indicate that augmentation is a crucial technique to Comparison of performance with existing research.Bolded values indicate the highest values.This section compares the results of this research with previous research.This comparison is limited to classification research using the eight-class ISIC-2019 dataset, as shown in Table 6.The performance of the four pretrained CNNs in this study with augmentation and Bayesian tuning can outperform almost all existing research results, except the accuracy performance is still defeated by Molina et al.